Search for: All records

Creators/Authors contains: "Sadayappan, P."

« Prev Next »

Total Resources

22

Resource Type
Conference Paper

19

Conference Proceeding

0

Dataset

0

Journal Article

3

Workshop Report

0

Availability
Full Text / Resource Available

20

Citation Only

2

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Scalable parallelization for the solution of phonon Boltzmann Transport Equation

https://doi.org/10.1145/3577193.3593723

Tran, Han D. ; Saurav, Siddharth ; Sadayappan, P. ; Mazumder, Sandip ; Sundar, Hari ( June 2023 , ACM)

The Boltzmann Transport Equation (BTE) for phonons is often used to predict thermal transport at submicron scales in semiconductors. The BTE is a seven-dimensional nonlinear integro-differential equation, resulting in difficulty in its solution even after linearization under the single relaxation time approximation. Furthermore, parallelization and load balancing are challenging, given the high dimensionality and variability of the linear systems' conditioning. This work presents a 'synthetic' scalable parallelization method for solving the BTE on large-scale systems. The method includes cell-based parallelization, combined band+cell-based parallelization, and batching technique. The essential computational ingredient of cell-based parallelization is a sparse matrix-vector product (SpMV) that can be integrated with an existing linear algebra library like PETSc. The combined approach enhances the cell-based method by further parallelizing the band dimension to take advantage of low inter-band communication costs. For the batched approach, we developed a batched SpMV that enables multiple linear systems to be solved simultaneously, merging many MPI messages to reduce communication costs, thus maintaining scalability when the grain size becomes very small. We present numerical experiments to demonstrate our method's excellent speedups and scalability up to 16384 cores for a problem with 12.6 billion unknowns.
more » « less
Free, publicly-accessible full text available June 21, 2024
Communication Optimization for Distributed Execution of Graph Neural Networks

Kurt, S.E. ; Yan, J ; Sukumaran-Rajam, A. ; Pandey, P ; Sadayappan, P ( May 2023 , Proceedings IEEE International Parallel and Distributed Processing Symposium)

Free, publicly-accessible full text available May 1, 2024
TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

https://doi.org/10.1145/3572848.3577478

Xiang, Lizhi ; Yin, Miao ; Zhang, Chengming ; Sukumaran-Rajam, Aravind ; Sadayappan, P. ; Yuan, Bo ; Tao, Dingwen ( February 2023 , The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2023))

Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21× speedup over cuDNN, 1.12× speedup over TVM, and 3.27× over the original models using cuDNN with at most 0.05% accuracy loss.
more » « less
Full Text Available
Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs

https://doi.org/10.1145/3559009.3569674

Xu, Yufan ; Yuan, Qiwei ; Barton, Erik Curtis ; Li, Rui ; Sadayappan, P. ; Sukumaran-Rajam, Aravind ( October 2022 , PACT'22)

Full Text Available
Sparsity-Aware Tensor Decomposition

https://doi.org/10.1109/IPDPS53621.2022.00097

Kurt, Sureyya Emre ; Raje, Saurabh ; Sukumaran-Rajam, Aravind ; Sadayappan, P. ( May 2022 , 2022 IEEE International Parallel and Distributed Processing Symposium)

Full Text Available
Comprehensive Accelerator-Dataflow Co-design Optimization for Convolutional Neural Networks

https://doi.org/10.1109/CGO53902.2022.9741281

Vaidya, Miheer ; Sukumaran-Rajam, Aravind ; Rountev, Atanas ; Sadayappan, P. ( April 2022 , International Symposium on Code Generation and Optimization (CGO))

Full Text Available
Training of deep learning pipelines on memory-constrained GPUs via segmented fused-tiled execution

https://doi.org/10.1145/3497776.3517766

Xu, Yufan ; Raje, Saurabh ; Rountev, Atanas ; Sabin, Gerald ; Sukumaran-Rajam, Aravind ; Sadayappan, P. ( March 2022 , 31st ACM SIGPLAN International Conference on Compiler Construction)

Full Text Available
Training of Deep Learning Pipelines on Memory-Constrained GPUs via Segmented Fused-Tiled Execution

Xu, Yufan ; Raje, Saurabh ; Rountev, Atanas Rountev ; Sabin, Gerald ; Sukumaran-Rajam, Aravind ; Sadayappan, P. ( February 2022 , ACM SIGPLAN International Conference on Compiler Construction (CC))

Full Text Available
Efficient Distributed Algorithms for Convolutional Neural Networks

https://doi.org/10.1145/3409964.3461828

Li, Rui ; Xu, Yufan ; Sukumaran-Rajam, Aravind ; Rountev, Atanas ; Sadayappan, P. ( July 2021 , Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’21))

Full Text Available
IOOpt: automatic derivation of I/O complexity bounds for affine programs

https://doi.org/10.1145/3453483.3454103

Olivry, Auguste ; Iooss, Guillaume ; Tollenaere, Nicolas ; Rountev, Atanas ; Sadayappan, P. ; Rastello, Fabrice ( June 2021 , 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation)
null (Ed.)
Full Text Available

« Prev Next »